simputation进行tidy化缺失值插补

R语言统计与绘图 2023-09-04

The following article is from Epi Science Author Epi Yu

❝
缺失值是常见的，通常分为3类缺失值的情形
完全随机缺失「Missing Completely at Random (MCAR)」，数据的缺失是完全随机，并不受任何人为或客观因素的干扰，这种情形在实际中几乎不存在或非常少见；
随机缺失「Missing at Random (MAR)」，数据的测定遭受了一定的系统和可预测的误差，比如说对于具有感冒症状的患者，通常而言他们具有咳嗽，头晕等症状，但体温可能不会测量，默认他们是处于发热状态，或者仅凭触感认为是发热，但实际没测量，这种情况下的缺失值是可以预测的，并可能通过其他数据进行预测和填补；
非随机缺失「Missing not at Random (MNAR)」，数据的缺失是由于非系统和可预测的方式造成的，这种情况下应尽可能记录缺失的原因，来推测可能存在的数据偏倚。
❞

加载R包

install.packages("simputation", dependencies = TRUE)
library(simputation)
library(tidyverse)

simputation提供了很多插补缺失值的方法，主要是以下插补方法：

基于模型的方法

线性回归
稳健线性回归
岭回归/弹性网络/lasso回归
CART模型（决策树）
随机森林
多元插补

基于最大期望值的方法

missForest
Donor imputation (including various donor pool specifications)

K最近邻法

sequential hotdeck (LOCF, NOCB)
random hotdeck
Predictive mean matching
其他

median imputation
Proxy imputation: 使用其他列的值或使用简单的转换得到的值.
Apply trained models for imputation purposes.

使用方法

impute_<model>(data, formula, [model-specific options])

impute_rlm: robust linear model
impute_en: ridge/elasticnet/lasso
impute_cart: CART
impute_rf: random forest
impute_rhd: random hot deck
impute_shd: sequential hot deck
impute_knn: k nearest neighbours
impute_mf: missForest
impute_em: mv-normal
impute_const: 用一个固定值插补
impute_lm: linear regression
impute_pmm: Hot-deck imputation
impute_median: 均值插补
impute_proxy: 自定义公式插补，可以用均值等

data是需要插补的数据框

formula指定需要插补的列。

[model-specific options]是根据所选模型不同有不同的参数。

线性回归插补

构建缺失数据

dat <- iris
dat[1:3, 1] <- dat[3:7, 2] <- dat[8:10, 5] <- NA
head(dat, 10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
NA	3.5	1.4	0.2	setosa
NA	3.0	1.4	0.2	setosa
NA	NA	1.3	0.2	setosa
4.6	NA	1.5	0.2	setosa
5.0	NA	1.4	0.2	setosa
5.4	NA	1.7	0.4	setosa
4.6	NA	1.4	0.3	setosa
5.0	3.4	1.5	0.2	NA
4.4	2.9	1.4	0.2	NA
4.9	3.1	1.5	0.1	NA

插补

da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
head(da1, 10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.076579	3.5	1.4	0.2	setosa
4.675654	3.0	1.4	0.2	setosa
NA	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa
4.600000	NA	1.4	0.3	setosa
5.000000	3.4	1.5	0.2	NA
4.400000	2.9	1.4	0.2	NA
4.900000	3.1	1.5	0.1	NA

使用中位数进行插补

这里根据Species进行分组后求中位数进行插补

da2 <- impute_median(da1, Sepal.Length ~ Species)
head(da2, 10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.076579	3.5	1.4	0.2	setosa
4.675654	3.0	1.4	0.2	setosa
5.000000	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa
4.600000	NA	1.4	0.3	setosa
5.000000	3.4	1.5	0.2	NA
4.400000	2.9	1.4	0.2	NA
4.900000	3.1	1.5	0.1	NA

使用决策树进行插补

.代表了除Species之外的所有变量

da3 <- impute_cart(da2, Species ~ .)
head(da3,10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.076579	3.5	1.4	0.2	setosa
4.675654	3.0	1.4	0.2	setosa
5.000000	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa
4.600000	NA	1.4	0.3	setosa
5.000000	3.4	1.5	0.2	setosa
4.400000	2.9	1.4	0.2	setosa
4.900000	3.1	1.5	0.1	setosa

使用多重插补

按照顺序分别使用线性回归，中位数，决策树分别进行插补

da4 <- dat %>% 
  impute_lm(Sepal.Length ~ Sepal.Width + Species) %>%
  impute_median(Sepal.Length ~ Species) %>%
  impute_cart(Species ~ .)
head(da4,10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.076579	3.5	1.4	0.2	setosa
4.675654	3.0	1.4	0.2	setosa
5.000000	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa
4.600000	NA	1.4	0.3	setosa
5.000000	3.4	1.5	0.2	setosa
4.400000	2.9	1.4	0.2	setosa
4.900000	3.1	1.5	0.1	setosa

使用固定值进行插补

Sepal.Length所有缺失值插入数值7

da4 <- impute_const(dat, Sepal.Length ~ 7)
head(da4,10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.0	3.5	1.4	0.2	setosa
7.0	3.0	1.4	0.2	setosa
7.0	NA	1.3	0.2	setosa
4.6	NA	1.5	0.2	setosa
5.0	NA	1.4	0.2	setosa
5.4	NA	1.7	0.4	setosa
4.6	NA	1.4	0.3	setosa
5.0	3.4	1.5	0.2	NA
4.4	2.9	1.4	0.2	NA
4.9	3.1	1.5	0.1	NA

复制其他变量值进行插补

复制Sepal.Width的数值来插补Sepal.Length

da4 <- impute_proxy(dat, Sepal.Length ~ Sepal.Width)
head(da4,10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
3.5	3.5	1.4	0.2	setosa
3.0	3.0	1.4	0.2	setosa
NA	NA	1.3	0.2	setosa
4.6	NA	1.5	0.2	setosa
5.0	NA	1.4	0.2	setosa
5.4	NA	1.7	0.4	setosa
4.6	NA	1.4	0.3	setosa
5.0	3.4	1.5	0.2	NA
4.4	2.9	1.4	0.2	NA
4.9	3.1	1.5	0.1	NA

使用稳健线性回归进行插补

相比线性回归更加稳固和保守

da5 <- impute_rlm(dat, Sepal.Length + Sepal.Width ~ Petal.Length + Species)
head(da5)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
4.945416	3.500000	1.4	0.2	setosa
4.945416	3.000000	1.4	0.2	setosa
4.854057	3.378979	1.3	0.2	setosa
4.600000	3.440107	1.5	0.2	setosa
5.000000	3.409543	1.4	0.2	setosa
5.400000	3.501236	1.7	0.4	setosa

使用均值和残差来插补

以Species来分组，按照线性回归计算残差加上平均值进行插补

da6 <- impute_lm(dat, . - Species ~ 0 + Species, add_residual = "normal") # Species用来分组
head(da6)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
4.345936	3.500000	1.4	0.2	setosa
4.499524	3.000000	1.4	0.2	setosa
5.647977	3.202251	1.3	0.2	setosa
4.600000	3.611873	1.5	0.2	setosa
5.000000	2.935475	1.4	0.2	setosa
5.400000	3.336509	1.7	0.4	setosa

分组插补

可以在公式中加入|来进行分组插补

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
4.968092	3.5	1.4	0.2	setosa
4.968092	3.0	1.4	0.2	setosa
4.968092	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa

或使用group_by进行分组后进行插补

dat %>% group_by(Species) %>% 
  impute_lm(Sepal.Length ~ Petal.Width)

使用`impute_proxy`自定义插补方法

自定义一个robust ratio imputation方法进行插补

dat <- impute_proxy(dat, Sepal.Length ~ median(Sepal.Length,na.rm=TRUE)/median(Sepal.Width, na.rm=TRUE) * Sepal.Width | Species)
head(dat)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.147059	3.5	1.4	0.2	setosa
4.411765	3.0	1.4	0.2	setosa
NA	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa

使用平均值进行插补

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

dat <- impute_proxy(dat, Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE))
head(dat)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.862585	3.5	1.4	0.2	setosa
5.862585	3.0	1.4	0.2	setosa
5.862585	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa

使用分组后平均值进行插补

根据Species分组后再计算平均值插补

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

dat <- impute_proxy(dat, Sepal.Length ~ mean(Sepal.Length,na.rm=TRUE) | Species)
head(dat)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.012766	3.5	1.4	0.2	setosa
5.012766	3.0	1.4	0.2	setosa
5.012766	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa

使用其他数据集中训练过的模型插补数据

这里的训练数据集可以是非常多种方法，可以自己定义，比如说逻辑回归，线性回归，决策树，十折交叉验证等等算法；这里采用最简单的线性回归

m <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
head(dat)

dat <- impute(dat, Sepal.Length ~ m)
head(dat)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.063856	3.5	1.4	0.2	setosa
4.662076	3.0	1.4	0.2	setosa
NA	NA	1.3	0.2	setosa
4.600000	NA	1.5	0.2	setosa
5.000000	NA	1.4	0.2	setosa
5.400000	NA	1.7	0.4	setosa

实际取得的效果与线性回归插补是一致的！

❝
除此之外还有其他更高级的插补算法，具体可参考VIM package参考文档
❞

致谢

❝
本文内容参考simputation Vignettes文档
❞

❝
希望以上内容有所帮助，喜欢请点赞，转发，赞赏请随意，谢谢支持！
❞

关注下方公众号，分享更多更好玩的R语言知识。

点个在看，SCI马上发表。

一把短刀，怎么就让他连捅18人？！

听纪委朋友说，有的领导干部在被抽掉鞋带和皮带后，一下就崩溃了，甚至个别胆小者顿时大小便失禁……

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

野村：牛市可能重蹈2015年的崩盘

simputation进行tidy化缺失值插补

加载R包

使用方法

线性回归插补

使用中位数进行插补

使用决策树进行插补

使用多重插补

使用固定值进行插补

复制其他变量值进行插补

使用稳健线性回归进行插补

使用均值和残差来插补

分组插补

使用`impute_proxy`自定义插补方法

使用平均值进行插补

使用分组后平均值进行插补

使用其他数据集中训练过的模型插补数据

致谢

您可能也对以下帖子感兴趣

一把短刀，怎么就让他连捅18人？！

听纪委朋友说，有的领导干部在被抽掉鞋带和皮带后，一下就崩溃了，甚至个别胆小者顿时大小便失禁……

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

野村：牛市可能重蹈2015年的崩盘

生成图片，分享到微信朋友圈

simputation进行tidy化缺失值插补

加载R包

使用方法

线性回归插补

使用中位数进行插补

使用决策树进行插补

使用多重插补

使用固定值进行插补

复制其他变量值进行插补

使用稳健线性回归进行插补

使用均值和残差来插补

分组插补

使用impute_proxy自定义插补方法

使用平均值进行插补

使用分组后平均值进行插补

使用其他数据集中训练过的模型插补数据

致谢

您可能也对以下帖子感兴趣

使用`impute_proxy`自定义插补方法